-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally disable logging in the data sampler to support predict_step #10127
Conversation
baf1c1f
to
c583c91
Compare
raise ValueError("non-None value not found") | ||
|
||
|
||
def get_dtype_device(torch_object) -> Tuple[torch.dtype, torch.device]: # noqa: D103 |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note test
|
||
|
||
# NOTE(SKH): These types are all wrong, but are close. The inner type must always be a torch.Tensor, but the outer container should be generic. | ||
def batch_collator(batches: Optional[Union[Tuple[ReductionT], List[ReductionT]]]) -> Optional[ReductionT]: |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note test
case [list(), *_]: | ||
return [batch_collator([batch[i] for batch in batches]) for i in range(len(batches[0]))] | ||
case None: | ||
return None |
Check warning
Code scanning / CodeQL
Unreachable code Warning test
f278078
to
479e2d8
Compare
stopping this for now until our CI is out of maintenance |
ac2075b
to
b62a392
Compare
eddcbb4
to
6f4e2d9
Compare
6f4e2d9
to
5766f47
Compare
e2672ee
to
88d9df9
Compare
88d9df9
to
92d6307
Compare
d0258da
to
0da36a9
Compare
Looks good, thanks! Can we clean up some of the old comments in the test script? |
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
…s to cwd Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
236f257
to
a6ff157
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
NVIDIA#10127) * Resolve merge conflicts with consumed sample logging Signed-off-by: John St John <jstjohn@nvidia.com> * Add test file that captures the predict step error Signed-off-by: John St John <jstjohn@nvidia.com> * Add fixme comment around proper checkpoint nemo2 handling Signed-off-by: John St John <jstjohn@nvidia.com> * Skip megatron training test on CPU nodes Signed-off-by: John St John <jstjohn@nvidia.com> * Move output_log to last arg for compatibility Signed-off-by: John St John <jstjohn@nvidia.com> * try setting the default root dir in predict to avoid writing artifacts to cwd Signed-off-by: John St John <jstjohn@nvidia.com> * Handle the new check for batch samplers to enable predict_step Signed-off-by: John St John <jstjohn@nvidia.com> * Only reset the global microbatch, not entire parallel state Signed-off-by: John St John <jstjohn@nvidia.com> * Destroy the right sets of state in test of lightning trainer Signed-off-by: John St John <jstjohn@nvidia.com> * Fix typo and rename state resetting functions Signed-off-by: John St John <jstjohn@nvidia.com> * Run test in a subprocess to avoid contaminating global state Signed-off-by: John St John <jstjohn@nvidia.com> --------- Signed-off-by: John St John <jstjohn@nvidia.com>
#10127) * Resolve merge conflicts with consumed sample logging Signed-off-by: John St John <jstjohn@nvidia.com> * Add test file that captures the predict step error Signed-off-by: John St John <jstjohn@nvidia.com> * Add fixme comment around proper checkpoint nemo2 handling Signed-off-by: John St John <jstjohn@nvidia.com> * Skip megatron training test on CPU nodes Signed-off-by: John St John <jstjohn@nvidia.com> * Move output_log to last arg for compatibility Signed-off-by: John St John <jstjohn@nvidia.com> * try setting the default root dir in predict to avoid writing artifacts to cwd Signed-off-by: John St John <jstjohn@nvidia.com> * Handle the new check for batch samplers to enable predict_step Signed-off-by: John St John <jstjohn@nvidia.com> * Only reset the global microbatch, not entire parallel state Signed-off-by: John St John <jstjohn@nvidia.com> * Destroy the right sets of state in test of lightning trainer Signed-off-by: John St John <jstjohn@nvidia.com> * Fix typo and rename state resetting functions Signed-off-by: John St John <jstjohn@nvidia.com> * Run test in a subprocess to avoid contaminating global state Signed-off-by: John St John <jstjohn@nvidia.com> --------- Signed-off-by: John St John <jstjohn@nvidia.com>
NVIDIA#10127) * Resolve merge conflicts with consumed sample logging Signed-off-by: John St John <jstjohn@nvidia.com> * Add test file that captures the predict step error Signed-off-by: John St John <jstjohn@nvidia.com> * Add fixme comment around proper checkpoint nemo2 handling Signed-off-by: John St John <jstjohn@nvidia.com> * Skip megatron training test on CPU nodes Signed-off-by: John St John <jstjohn@nvidia.com> * Move output_log to last arg for compatibility Signed-off-by: John St John <jstjohn@nvidia.com> * try setting the default root dir in predict to avoid writing artifacts to cwd Signed-off-by: John St John <jstjohn@nvidia.com> * Handle the new check for batch samplers to enable predict_step Signed-off-by: John St John <jstjohn@nvidia.com> * Only reset the global microbatch, not entire parallel state Signed-off-by: John St John <jstjohn@nvidia.com> * Destroy the right sets of state in test of lightning trainer Signed-off-by: John St John <jstjohn@nvidia.com> * Fix typo and rename state resetting functions Signed-off-by: John St John <jstjohn@nvidia.com> * Run test in a subprocess to avoid contaminating global state Signed-off-by: John St John <jstjohn@nvidia.com> --------- Signed-off-by: John St John <jstjohn@nvidia.com> Signed-off-by: adityavavre <aditya.vavre@gmail.com>
Pytorch lightning's
predict_step
does not support logging on the module. This PR adds an option to the data sampler to disable logging which allows the predict step to work.